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Abstract 

Three state-of-the-art statistical parsers are com- 
bined to produce more accurate parses, as well as 
new bounds on achievable Treebank parsing accu- 
racy. Two general approaches are presented and two 
combination techniques are described for each ap- 
proach. Both parametric and non-parametric mod- 
els are explored. The resulting parsers surpass the 
best previously published performance results for 
the Penn Treebank. 



1 Introduction 

The natural language processing community is in the 
strong position of having many available approaches 
to solving some of its most fundamental problems. 
The machine learning community has been in a simi- 
lar situation and has studied the combination of mul- 



tiple classifiers flWolpert, 1992| ; |Heath et al., 1996| ) . 
Their theoretical finding is simply stated: classifica- 
tion error rate decreases toward the noise rate ex- 
ponentially in the number of independent, accurate 
classifiers. The theory has also been validated em- 
pirically. 

Recently, combination techniques have been in- 
vestigat ed for part of speech taggi n g with positive 



resul ts ( van Halteren et al., 1998| ; Brill and Wu 
1998] ). In both cases the investigators were able to 



achieve significant improvements over the previous 
best tagging results. Similar advances have been 



made in ma chine translation ( Frcdorking and Niren-. 
burg, 1994 ), speech recog nition flFiscus* 19971) and 



named entity recognition ( Borthwick et al., 199S ). 

The corpus-based statistical parsing community 
has many fast and accurate automat ed parsing sys - 
tems, including s ystem s produced by Collins (1997 ) , 
Charniak (1997|) and |Ratnaparkhi (1997|) . These 



three parsers have given the best reported parsing 
results on the Penn Treeba nk Wall Street Journal 
corpus ( Marcus et al., 1992 ). We used these three 



parsers to explore parser combination techniques. 



2 Techniques for Combining Parsers 
2.1 Parse Hybridization 

We are interested in combining the substructures of 
the input parses to produce a better parse. We 
call this approach parse hybridization. The sub- 
structures that are unanimously hypothesized by the 
parsers should be preserved after combination, and 
the combination technique should not foolishly cre- 
ate substructures for which there is no supporting 
evidence. These two principles guide experimenta- 
tion in this framework, and together with the evalu- 
ation measures help us decide which specific type of 
substructure to combine. 

The precision and recall measures (described in 
more detail in Section used in evaluating Tree- 
bank parsing treat each constituent as a separate 
entity, a minimal unit of correctness. Since our goal 
is to perform well under these measures we will simi- 
larly treat constituents as the minimal substructures 
for combination. 

2.1.1 Constituent Voting 

One hybridization strategy is to let the parsers vote 
on constituents' membership in the hypothesized 
set. If enough parsers suggest that a particular con- 
stituent belongs in the parse, we include it. We call 
this technique constituent voting. We include a con- 
stituent in our hypothesized parse if it appears in the 
output of a majority of the parsers. In our particular 
case the majority requires the agreement of only two 
parsers because we have only three. This technique 
has the advantage of requiring no training, but it 
has the disadvantage of treating all parsers equally 
even though they may have differing accuracies or 
may specialize in modeling different phenomena. 

2.1.2 Naive Bayes 

Another technique for parse hybridization is to use 
a naive Bayes classifier to determine which con- 
stituents to include in the parse. The development of 
a naive Bayes classifier involves learning how much 
each parser should be trusted for the decisions it 
makes. Our original hope in combining these parsers 
is that their errors are independently distributed. 
This is equivalent to the assumption used in proba- 



bility estimation for naive Bayes classifiers, namely 
that the attribute values are conditionally indepen- 
dent when the target value is given. For this reason, 
naive Bayes classifiers are well-matched to this prob- 
lem. 

In Equations |l| through || we develop the model 
for constructing our parse using naive Bayes classi- 
fication. C is the union of the sets of constituents 
suggested by the parsers. n(c) is a binary function 
returning t (for true) precisely when the constituent 
c e C should be included in the hypothesis. Mi(c) 
is a binary function returning t when parser i (from 
among the k parsers) suggests constituent c should 
be in the parse. The hypothesized parse is then the 
set of constituents that are likely (P > 0.5) to be in 
the parse according to this model. 



argmaxP(7r(c)|Afi(c) . . . M k (c)) 

7r(c) 

P(M 1 (c)...M k (c)\n{c))P{n(c)) 
~ ar fS aX P(M 1 (c)...M fc (c)) 



(1) 



= argmaxP(7r(c)) 



P(M 1 (c)|tt(c)) 



t(c) 



f = \ P(Mi(c)) 



(2) 



= argmaxP(7r(c))n P{Mi{c)\n(c)) (3) 



t(c) 



The estimation of the probabilities in the model is 
carried out as shown in Equation ||. Here N(-) 
counts the number of hypothesized constituents in 
the development set that match the binary predi- 
cate specified as an argument. 



P(tt( C ) =t)'[[P(M i (c)\w(c) = t) 

i=l 

N(n(c)=t)J^ N(M i (c),ir(c)=t) (4) 



i=l 



N(tt(c) = t) 



2.1.3 Lemma: No Crossing Brackets 

Under certain conditions the constituent voting and 
naive Bayes constituent combination techniques are 
guaranteed to produce sets of constituents with no 
crossing brackets. There are simply not enough 
votes remaining to allow any of the crossing struc- 
tures to enter the hypothesized constituent set. 

Lemma: If the number of votes required by con- 
stituent voting is greater than half of the parsers 
under consideration the resulting structure has no 
crossing constituents. 

Proof: Assume a pair of crossing constituents ap- 
pears in the output of the constituent voting tech- 
nique using k parsers. Call the crossing constituents 
A and B. A receives a votes, and B receives b votes. 
Each of the constituents must have received at least 
l^rl votes from the k parsers, so a > and 



b > ■ Let s = a + b. None of the parsers pro- 

duce parses with crossing brackets, so none of them 
votes for both of the assumed constituents. Hence, 
s < k. But by addition of the votes on the two 
parses, s > 2|~^±i] > k, a contradiction. ■ 
Similarly, when the naive Bayes classifier is con- 
figured such that the constituents require estimated 
probabilities strictly larger than 0.5 to be accepted, 
there is not enough probability mass remaining on 
crossing brackets for them to be included in the hy- 
pothesis. 

2.2 Parser Switching 

In general, the lemma of the previous section does 
not ensure that all the productions in the combined 
parse are found in the grammars of the member 
parsers. There is a guarantee of no crossing brackets 
but there is no guarantee that a constituent in the 
tree has the same children as it had in any of the 
three original parses. One can trivially create sit- 
uations in which strictly binary-branching trees are 
combined to create a tree with only the root node 
and the terminal nodes, a completely flat structure. 

This drastic tree manipulation is not appropriate 
for situations in which we want to assign particu- 
lar structures to sentences. For example, we may 
have semantic information (e.g. database query op- 
erations) associated with the productions in a gram- 
mar. If the parse contains productions from outside 
our grammar the machine has no direct method for 
handling them (e.g. the resulting database query 
may be syntactically malformed). 

We have developed a general approach for combin- 
ing parsers when preserving the entire structure of 
a parse tree is important. The combining algorithm 
is presented with the candidate parses and asked to 
choose which one is best. The combining technique 
must act as a multi-position switch indicating which 
parser should be trusted for the particular sentence. 
We call this approach parser switching. Once again 
we present both a non-parametric and a parametric 
technique for this task. 

2.2.1 Similarity Switching 

First we present the non-parametric version of parser 
switching, similarity switching: 

1. From each candidate parse, m, for a sentence, 
create the constituent set Si by converting each 
constituent into its tuple representation. 

2. The score for m is ^ \Sj H Si\, where j ranges 

over the candidate parses for the sentence. 

3. Switch to (use) the parser with the highest score 
for the sentence. Ties are broken arbitrarily. 

The intuition for this technique is that we can 
measure a similarity between parses by counting the 



constituents they have in common. We pick the 
parse that is most similar to the other parses by 
choosing the one with the highest sum of pairwise 
similarities. This is the parse that is closest to the 
centroid of the observed parses under the similarity 
metric. 

2.2.2 Naive Bayes 

The probabilistic version of this procedure is 
straightforward. We once again assume indepen- 
dence among our various member parsers. Further- 
more, we know one of the original parses will be the 
hypothesized parse, so the direct method of deter- 
mining which one is best is to compute the proba- 
bility of each of the candidate parses using the prob- 
abilistic model we developed in Section 2.1. We 



model each parse as the decisions made to create 
it, and model those decisions as independent events. 
Each decision determines the inclusion or exclusion 
of a candidate constituent. The set of candidate 
constituents comes from the union of all the con- 
stituents suggested by the member parsers. This 
is summarized in Equation [|. The computation of 
P(iri(c)\Mi . . . Mk(c)) has been sketched before in 
Equations [j] through ||. In this case we are inter- 
ested in finding the maximum probability parse, 7Tj, 
and Mi is the set of relevant (binary) parsing deci- 
sions made by parser i. ^ is a parse selected from 
among the outputs of the individual parsers. It is 
chosen such that the decisions it made in including 
or excluding constituents are most probable under 
the models for all of the parsers. 

argmaxP(7r,|Mi . . . M&) 

= argmaxJJP(7r i (c)|Mi(c)...M fc (c)) (5) 

C 

3 Experiments 

The three parsers were trained and tuned by their 
creators on various sections of the WSJ portion of 
the Penn Treebank, leaving only sections 22 and 23 
completely untouched during the development of any 
of the parsers. We used section 23 as the develop- 
ment set for our combining techniques, and section 
22 only for final testing. The development set con- 
tained 44088 constituents in 2416 sentences and the 
test set contained 30691 constituents in 1699 sen- 
tences. A sentence was withheld from section 22 
because its extreme length was troublesome for a 
couple of the parsers^] 

The standard measures for evaluating Penn Tree- 
bank parsing performance are precision and recall of 
the predicted constituents. Each parse is converted 
into a set of constituents represented as a tuples: 

1 The sentence in question was more than 100 words in 
length and included nested quotes and parenthetical expres- 
sions. 



(label, start, end). The set is then compared with 
the set generated from the Penn Treebank parse to 
determine the precision and recall. Precision is the 
portion of hypothesized constituents that are cor- 
rect and recall is the portion of the Treebank con- 
stituents that are hypothesized. 

For our experiments we also report the mean of 
precision and recall, which we denote by (P + R)/2 
and F-measure. F-measure is the harmonic mean of 
precision and recall, 2PR/(P + R). It is closer to 
the smaller value of precision and recall when there 
is a large skew in their values. 

We performed three experiments to evaluate our 
techniques. The first shows how constituent features 
and context do not help in deciding which parser 
to trust. We then show that the combining tech- 
niques presented above give better parsing accuracy 
than any of the individual parsers. Finally we show 
the combining techniques degrade very little when a 
poor parser is added to the set. 

3.1 Context 

It is possible one could produce better models by in- 
troducing features describing constituents and their 
contexts because one parser could be much better 
than the majority of the others in particular situa- 
tions. For example, one parser could be more ac- 
curate at predicting noun phrases than the other 
parsers. None of the models we have presented uti- 
lize features associated with a particular constituent 
(i.e. the label, span, parent label, etc.) to influence 
parser preference. This is not an oversight. Fea- 
tures and context were initially introduced into the 
models, but they refused to offer any gains in per- 
formance. While we cannot prove there are no such 
useful features on which one should condition trust, 
we can give some insight into why the features we 
explored offered no gain. 

Because we are working with only three parsers, 
the only situation in which context will help us is 
when it can indicate we should choose to believe a 
single parser that disagrees with the majority hy- 
pothesis instead of the majority hypothesis itself. 
This is the only important case, because otherwise 
the simple majority combining technique would pick 
the correct constituent. One side of the decision 
making process is when we choose to believe a con- 
stituent should be in the parse, even though only 
one parser suggests it. We call such a constituent an 
isolated constituent. If we were working with more 
than three parsers we could investigate minority con- 
stituents, those constituents that are suggested by 
at least one parser, but which the majority of the 
parsers do not suggest. 

Adding the isolated constituents to our hypothe- 
sis parse could increase our expected recall, but in 
the cases we investigated it would invariably hurt 
our precision more than we would gain on recall. 



Consider for a set of constituents the isolated con- 
stituent precision parser metric, the portion of iso- 
lated constituents that are correctly hypothesized. 
When this metric is less than 0.5, we expect to in- 
cur more errors^ than we will remove by adding those 
constituents to the parse. 

We show the results of three of the experiments we 
conducted to measure isolated constituent precision 
under various partitioning schemes. In Table |] we 
see with very few exceptions that the isolated con- 
stituent precision is less than 0.5 when we use the 
constituent label as a feature. The counts represent 
portions of the approximately 44000 constituents hy- 
pothesized by the parsers in the development set. 
In the cases where isolated constituent precision is 
larger than 0.5 the affected portion of the hypotheses 
is negligible. 

Similarly Figures [j] and || show how the iso- 
lated constituent precision varies by sentence length 
and the size of the span of the hypothesized con- 
stituent. In each figure the upper graph shows the 
isolated constituent precision and the bottom graph 
shows the corresponding number of hypothesized 
constituents. Again we notice that the isolated con- 
stituent precision is larger than 0.5 only in those 
partitions that contain very few samples. From this 
we see that a finer-grained model for parser combi- 
nation, at least for the features we have examined, 
will not give us any additional power. 

3.2 Performance Testing 

The results in Table || were achieved on the develop- 
ment set. The first two rows of the table are base- 
lines. The first row represents the average accuracy 
of the three parsers we combine. The second row 
is the accuracy of the best of the three parsers.^ 
The next two rows are results of oracle experiments. 
The parser switching oracle is the upper bound on 
the accuracy that can be achieved on this set in the 
parser switching framework. It is the performance 
we could achieve if an omniscient observer told us 
which parser to pick for each of the sentences. The 
maximum precision row is the upper bound on accu- 
racy if we could pick exactly the correct constituents 
from among the constituents suggested by the three 
parsers. Another way to interpret this is that less 
than 5% of the correct constituents are missing from 
the hypotheses generated by the union of the three 
parsers. The maximum precision oracle is an upper 
bound on the possible gain we can achieve by parse 
hybridization. 

2 This is in absolute terms, total errors being the sum of 
precision errors and recall errors. 

3 The identity of this parser is not given, nor is the iden- 
tity disclosed for the results of any of the individual parsers. 
We do not aim to compare the performance of the individual 
parsers, nor do we want to bias further research by giving the 
individual parser results for the test set. 



Parser 


Sentences 


% 


Parser 1 


279 


16 


Parser 2 


216 


13 


Parser 3 


1204 


71 



Table 4: Bayes Switching Parser Usage 



We do not show the numbers for the Bayes models 
in Table || because the parameters involved were es- 
tablished using this set. The precision and recall of 
similarity switching and constituent voting are both 
significantly better than the best individual parser, 
and constituent voting is significantly better than 
parser switching in precision.^ Constituent voting 
gives the highest accuracy for parsing the Penn Tree- 
bank reported to date. 

Table || contains the results for evaluating our sys- 
tems on the test set (section 22). All of these systems 
were run on data that was not seen during their de- 
velopment. The difference in precision between sim- 
ilarity and Bayes switching techniques is significant, 
but the difference in recall is not. This is the first 
set that gives us a fair evaluation of the Bayes mod- 
els, and the Bayes switching model performs signif- 
icantly better than its non-parametric counterpart. 
The constituent voting and naive Bayes techniques 
are equivalent because the parameters learned in the 
training set did not sufficiently discriminate between 
the three parsers. 

Table § shows how much the Bayes switching tech- 
nique uses each of the parsers on the test set. Parser 
3, the most accurate parser, was chosen 71% of the 
time, and Parser 1, the least accurate parser was cho- 
sen 16% of the time. Ties are rare in Bayes switch- 
ing because the models are fine-grained - many es- 
timated probabilities are involved in each decision. 

3.3 Robustness Testing 

In the interest of testing the robustness of these com- 
bining techniques, we added a fourth, simple non- 
lexicalized PCFG parser. The PCFG was trained 
from the same sections of the Penn Treebank as the 
other three parsers. It was then tested on section 
22 of the Treebank in conjunction with the other 
parsers. 

The results of this experiment can be seen in Ta- 
ble 0. The entries in this table can be compared with 
those of Table || to see how the performance of the 
combining techniques degrades in the presence of an 
inferior parser. As seen by the drop in average indi- 
vidual parser performance baseline, the introduced 
parser does not perform very well. The average in- 
dividual parser accuracy was reduced by more than 
5% when we added this new parser, but the preci- 

4 All significance claims are made based on a binomial hy- 
pothesis test of equality with an a < 0.01 confidence level. 



Constituent 


Parserl 


Parser2 


Parser3 


Label 


count 


Precision 


count 


Precision 


count 


Precision 


ADJP 


132 


28.78 


215 


21.86 


173 


34.10 


ADVP 


150 


25.33 


129 


21.70 


102 


31.37 


CONJP 


2 


50.00 


8 


37.50 


3 


0.00 


FRAG 


51 


3.92 


29 


27.58 


11 


9.09 


INTJ 


3 


66.66 


1 


100.00 


2 


50.00 


LST 





NA 





NA 





NA 


NAC 





NA 


13 


53.84 


7 


14.28 


NP 


1489 


21.08 


1550 


18.38 


1178 


27.33 


NX 


7 


85.71 


9 


22.22 


3 


0.00 


PP 


732 


23.63 


643 


20.06 


503 


27.83 


PRN 


20 


55.00 


33 


54.54 


38 


15.78 


PRT 


12 


16.66 


20 


40.00 


16 


37.50 


QP 


21 


38.09 


34 


44.11 


76 


14.47 


RRC 


1 


0.00 


1 


0.00 


2 


0.00 


S 


757 


13.73 


482 


23.65 


434 


38.94 


SBAR 


331 


11.78 


196 


23.97 


178 


34.83 


SBARQ 





NA 


6 


16.66 


3 


0.00 


SINV 


3 


66.66 


11 


81.81 


13 


30.76 


SQ 


2 





11 


18.18 


3 


33.33 


UCP 


6 


16.66 


12 


8.33 


8 


12.50 


VP 


868 


13.36 


630 


24.12 


477 


35.42 


WHADJP 





NA 





NA 


1 


0.00 


WHADVP 


2 


100.00 


5 


40.00 


1 


100.00 


WHNP 


33 


33.33 


8 


25.00 


17 


58.82 


WHPP 





NA 





NA 


2 


100.00 


X 





NA 


2 


100.00 


1 


0.00 



Table 1: Isolated Constituent Precision By Constituent Label 



Reference / System 


P R 


(P+R)/2 F 


Average Individual Parser 
Best Individual Parser 


87.14 86.91 
88.73 88.54 


87.02 87.02 
88.63 88.63 


Parser Switching Oracle 
Maximum Precision Oracle 


93.12 92.84 
100.00 95.41 


92.98 92.98 
97.70 97.65 


Similarity Switching 


89.50 89.88 


89.69 89.69 


Constituent Voting 


92.09 89.18 


90.64 90.61 


Table 2: Summary of Development Set Performance 


Reference / System 


P R 


(P+R)/2 F 


Average Individual Parser 
Best Individual Parser 


87.61 87.83 
89.61 89.73 


87.72 87.72 
89.67 89.67 


Parser Switching Oracle 
Maximum Precision Oracle 


93.78 93.87 
100.00 95.91 


93.82 93.82 
97.95 97.91 


Similarity Switching 
Bayes Switching 


90.04 90.81 
90.78 90.70 


90.43 90.43 
90.74 90.74 


Constituent Voting 
Naive Bayes 


92.42 90.10 
92.42 90.10 


91.26 91.25 
91.26 91.25 



Table 3: Test Set Results 
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Figure 1: Isolated Constituent Precision and Sentence Length 



sion of the constituent voting technique was the only 
result that decreased significantly. The Bayes mod- 
els were able to achieve significantly higher preci- 
sion than their non-parametric counterparts. We see 
from these results that the behavior of the paramet- 
ric techniques are robust in the presence of a poor 
parser. Surprisingly, the non-parametric switching 



technique also exhibited robust behaviour in this sit- 
uation. 

4 Conclusion 

We have presented two general approaches to study- 
ing parser combination: parser switching and parse 
hybridization. For each experiment we gave an non- 



100 



60 



40 



20 









Parser 1 i— 

Parser 2 — x- 
Parser 3 





1600 



1400 



1200 



1000 



800 



600 



400 - 



200 



10 



20 



30 



60 




Parser 1 i 

Parser 2 — x--- 
Parser3 



10 



20 30 40 

Constituent Span Length (Tokens) 



50 



60 



Figure 2: Isolated Constituent Precision and Span Length 



parametric and a parametric technique for combin- 
ing parsers. All four of the techniques studied result 
in parsing systems that perform better than any pre- 
viously reported. Both of the switching techniques, 
as well as the parametric hybridization technique 
were also shown to be robust when a poor parser was 
introduced into the experiments. Through parser 



combination we have reduced the precision error rate 
by 30% and the recall error rate by 6% compared to 
the best previously published result. 

Combining multiple highly-accurate independent 
parsers yields promising results. We plan to explore 
more powerful techniques for exploiting the diversity 
of parsing methods. 



Reference / System 


P R 


(P+R)/2 F 


Average Individual Parser 
Best Individual Parser 


84.55 80.91 
89.61 89.73 


82.73 82.69 
89.67 89.67 


Parser Switching Oracle 
Maximum Precision Oracle 


93.92 93.88 
100.00 96.66 


93.90 93.90 
98.33 98.30 


Similarity Switching 
Bayes Switching 


89.90 90.89 
90.94 90.70 


90.40 90.39 
90.82 90.82 


Constituent Voting 
Naive Bayes 


89.78 91.80 
92.42 90.10 


90.79 90.78 
91.26 91.25 



Table 5: Robustness Test Results 
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